

### Software Design and Integration for Embedded Multimedia Applications by Successive Refinement

#### **Katalin Popovici**

katalin.popovici@mathworks.com

The MathWorks, France

© 2008 The MathWorks, Inc.



### MATLAB® SIMULINK®

#### **Acknowledgement**

- TIMA (*Techniques of Informatics and Microelectronics for Computers Architecture*) Labs, INPG, CNRS
- Dr. Ahmed Jerraya, Head of Design Group, CEA-LETI, France
- Prof. Frederic Petrot, TIMA Labs, France
- Prof. Frederic Rousseau, TIMA Labs, France



#### **Summary**

MPSoC (<u>M</u>ulti-<u>P</u>rocessor <u>S</u>ystem <u>O</u>n <u>C</u>hip) integrates different components (hardware and software) on a single chip



#### Context:

- Heterogeneous MPSoC are required by current multimedia applications
  - E.g. TI OMAP, ST Nomadik, Philips Nexperia, Atmel Diopsis
  - DSP + μC + Sophisticated Communication Infrastructure
- Multiple Software (SW) Stacks

#### Problem:

- Classic programming environments do not fit:
  - High level programming environments are not efficient to handle specific architecture capabilities (e.g. C/C++, Simulink)
  - HW (Virtual) Prototypes are too detailed and time consuming for SW debug



#### Challenge:

Efficient and Fast Programming Environment for Heterogeneous MPSoC

#### Proposal:

- SW development and validation environment using Simulink & SystemC
- Communication mapping exploration



3



### MATLAB® SIMULINK®

#### **Outline**

- Introduction
- Software Design and Validation
  - System Architecture
  - Virtual Architecture
  - Transaction Accurate Architecture
- Conclusions

HW-SS

Software

SW-SS

Interconnect Component

Application

Hardware dependent

Software (HdS)



#### **MPSoC Architecture**

- Heterogeneous MPSoC:
  - SW subsystems for flexibility
  - HW subsystems for performance
  - Complex communication network
    - Bus based architectures
    - Network on Chip (NoC) architectures
- SW Subsystem:
  - Specific CPU Subsystem:
    - CPUs: GPP, DSP, ASIP
    - I/O + memory architecture + other peripherals
  - Layered SW Architecture:
    - Application code (tasks)
    - Hardware dependent Software (HdS)
      - Specific to Architecture/Application to achieve efficiency



Generate SW efficiently by using HW resources for Communication & Synchro.



# **Example of Heterogeneous MPSoC: Reduced Atmel Diopsis RDT**

- ARM9 SS
- DSP SS
- MEM SS
- POT SS (Periph. On Tile)
  - I/O peripherals
  - System Peripherals
- Interconnect: AMBA bus



- Local & global memories accessible by both processing units
  - Different communication schemes between CPUs
- Require Multiple Software Stacks (ARM + DSP)



#### The MathWorks

#### **Software Stack Organization**



- SW Stack is organized into layers:
  - Application code:
    - SW code of tasks mapped on the CPU
  - HdS (Hardware dependent Software) made of different components:
    - OS (Operating System)
    - Comm (Communication Primitives)
    - HAL (Hardware Abstraction Layer)
    - API (Application Programming Interface)
- Different SW components need to be validated incrementally
  - Different abstraction levels corresponding to the different SW components
  - SW development platforms (HW abstraction models) to allow specific SW components debug and communication refinement

7

### The MathWorks

#### MATLAB®SIMULINK®

### **Software Development Platform**



- User SW Code
  - C/C++, Simulink functions, binary,...
- Development platform to abstract architectures
  - Runtime library, simulator (ISS, Simulink)
- Executable model generation
  - Compile, Link
- Debug
  - Iterative process
  - Different SW components need different detail levels
- Requirements for MPSoC executable models:
  - Speed
    - Easily experiment several mapping schemes
    - Multiple SW stacks
  - Accuracy
    - Evaluate the effect on performance by using specific HW resources
    - Debug low level SW code

#### MATLAB&SIMULINK®

Software Design Flow





#### MATLAB&SIMULINK®

### Why Simulink?

- Adapted environment for Complex Algorithm modeling
- Rich library of predefined functional blocks
- Offers a set of algorithms blocks for a variety of applications
  - Signal Processing Blockset: FFT, DCT, IDCT, IFFT, ...
  - Video Processing Blockset: SAD, Edge Detection, PSNR, Block matching
- User defined blocks integration (S-Functions)
- Provides simulation, profiling and code generation facilities
  - Real Time Workshop (RTW) for C code generation
  - HDL Coder for VHDL generation
- Open issue: Algorithm mapping and refining for MPSoC



#### Why SystemC?

- Standard System Level Design Language
  - Unified language for HW & SW development based on C++ extension
- Concurrency support: hardware modules
- Concept of time (clocks, delays with custom wait() calls)



- Communication model: signals, protocols
- Reactivity to events: support of events, sensitivity list
- Integrated Simulation Core for the Realization of Executable models
  - Modeling and Simulation within a wide range of abstraction levels
- Still Low Level Design Language
  - Not easy to specify complex systems at algorithm level



#### **Outline**

- Introduction
- Software Design and Validation
  - System Architecture
  - Virtual Architecture
  - Transaction Accurate Architecture
- Conclusions



**SW Architecture** 

## The MathWorks

System Architecture Model



System Architecture





#### **System Architecture Design**



Algorithm validation through simulation

Explicit annotation for implementation

15

Intra-SS



#### MATLAB&SIMULINK®

# Application Example: M-JPEG Decoder Mapped on Diopsis RDT





### System Architecture Level Software Development Platform for Diopsis RDT



Communication units: Simulink Signals

- 5 Intra SS communication units
  - depends on application
- 3 Inter SS communication units
  - depends on application
- Generic channels to be mapped on resources
- Execution model in Simulink

Architecture parameters annotating the model:

- ResourceType
  - ARM9, DSP, POT, Task
  - Communication: swfifo, dmem, sram, reg, dxm
- NetworkType: AMBA\_AHB, NoC
- AccessType: DMA, direct
- MemName
- Validation of application functionality

17



#### MATLAB&SIMULINK®

### **MJPEG System Architecture in Simulink**

- 7 S-Functions
- Algorithm validation,10 frames QVGA YUV 444
- Simulation time: 15s on PC 1.73GHz, 1GBytes RAM





### Capture of low level architecture features in Simulink for MJPEG



◆ The MathWorks

#### MATLAB&SIMULINK®

### Capture of low level architecture features in Simulink for MJPEG







### Capture of low level architecture features in Simulink for MJPEG



Easy to experiment different communication schemes

21



#### MATLAB® SIMULINK®

#### Virtual Architecture Model



#### MATLAB&SIMULINK®

#### Virtual Architecture Design

- Hardware Architecture
  - SystemC TLM, message accurate model
  - Tasks encapsulated in SC\_THREADS
  - Inter-SS communication units partially mapped on the resources
  - Abstract interconnect component
  - Intra-SS communication units become software communication channels

#### Simulation

- Task scheduled by the SystemC scheduler
- Task code validation and partitioning

#### Abstract CPU-SS1 Abstract CPU-SS2 MEM .... HDS AP ....

#### Software Architecture

Task C code based on HdS APIs

Task code of T2



Hardware platform

CPU-SS2 (SC\_MODULE) Communication channel void recv (fifo\_ch\* ch, void\* dst, int size) { SC\_MODULE (CPU\_SS2) { dst = ch->read (size); Task\_T2 \* T2; // tasks in port \*in; // ports out\_port \*out; class fifo\_ch : public sc\_prim\_channel { // channels fifo\_ch\*ch1; word \*buffer; Task\_T2 (SC\_THREAD) public: word \* read (int size) { SC\_MODULE (Task\_T2){... for (i=0; i<size; i++) SC\_CTOR (Task\_T2){ \*(ret+i)=\*(buffer+i); SC\_THREAD(task\_T2, clk); return ret;}

#### The MathWorks

#### MATLAB&SIMULINK®

### **Application Example:** M-JPEG Decoder mapped on Diopsis RDT





## **Application Example:**

### M-JPEG Decoder mapped on Diopsis RDT





#### MATLAB®SIMULINK®

#### Results for the M-JPEG Decoder mapped on Diopsis **RDT** at the Virtual Architecture Level

• 3 inter-SS communication mapping schemes: total messages through AMBA, execution time

| Comm.<br>Unit | ch1_T2T3<br>(256 bytes) | ch2_T2T3<br>(4 bytes) | ch_T3T4<br>(64 bytes) | ch1_T1T2-<br>ch5_T1T2 | Total messages<br>AMBA | Execution Time [ns] (1 clock cycle 20ns) |  |
|---------------|-------------------------|-----------------------|-----------------------|-----------------------|------------------------|------------------------------------------|--|
| MJPEG         | DXM                     | DXM                   | DXM                   | SWFIFO                | 216000                 | 4464060                                  |  |
|               | DXM                     | REG                   | DMEM                  | SWFIFO                | 144000                 | 3720060                                  |  |
|               | SRAM                    | SRAM                  | DMEM                  | SWFIFO                | 108000                 | 2232020                                  |  |

Simulation time 14s (DXM+REG+DMEM), 10 frames QVGA YUV 444 format

Simulation Screenshot



Validation of task code and partitioning



#### MATLAB&SIMULINK®

#### **Transaction Accurate Architecture Model**





#### MATLAB&SIMULINK®

#### **Transaction Accurate Architecture Design**

#### ■ Hardware Architecture

- SystemC TLM model
- Detailed CPU-SS local architecture
- Abstract CPU cores
- Explicit communication protocol
- Explicit interconnect component (bus, NoC)

#### Simulation

- Task scheduled by the OS scheduler
- Validation of OS & Comm integration

#### TA platform



#### CPU-SS1 CPU-SS2 Abstract Abstract MEM-SS Memory Memory CPU<sub>1</sub> HDS API ШШШ Interface Periph. Interface Periph Comm Interconnect Component (Bus/NoC)

#### Software Architecture

- Task code + OS + Communication
- Based on HAL APIs

#### SW Stack code on CPU-SS2

|                                                                         | SW Stack code on CPU-SS2                                                                                       |  |  |  |
|-------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------|--|--|--|
| main                                                                    | Communication SW                                                                                               |  |  |  |
| extern void task_T2 ();<br>voidstart (void) {<br>create_task (task_T2); | <pre>void recv(ch, dst, size) {   switch (ch.protocol){   case FIFO:   if (ch.state==EMPTY) _schedule();</pre> |  |  |  |
|                                                                         | OS                                                                                                             |  |  |  |
| C code of Task_T2                                                       | OS                                                                                                             |  |  |  |
| C code of Task_T2  void task_T2() {                                     | OS voidschedule(void) {                                                                                        |  |  |  |
|                                                                         |                                                                                                                |  |  |  |
| void task_T2() {                                                        | voidschedule(void) {                                                                                           |  |  |  |



## **Application Example:**

#### M-JPEG Decoder mapped on Diopsis RDT



- Local SS architectures detailed
- Abstract CPU execution models
- AMBA bus protocol fully modeled
- Inter-SS communication fully mapped on explicit resources
- Intra-SS communication managed by OS
- Simulation time 5m10s (DXM+REG+SRAM), 10 frames QVGA
- Validation of OS and Comm. integration



Simulation i Screenshot

based on HAL API



#### MATLAB&SIMULINK®

### Results for the M-JPEG Decoder at the Transaction Accurate Architecture Level



Communication Mapping Exploration

| Communication Scheme | Transa | ctions to | memories | Total AMBA |        |      |
|----------------------|--------|-----------|----------|------------|--------|------|
|                      | DXM    | SRAM      | REG      | DMEM       | cycles |      |
| DXM+DXM+DXM          | 5256k  | 0         | 0        | 0          | 8856k  | 100% |
| DXM+REG+DMEM         | 4608k  | 0         | 72k      | 576k       | 7884k  | 89%  |
| SRAM+SRAM+DMEM       | 0      | 4680k     | 0        | 576k       | 3960k  | 45%  |



#### **Outline**

- Introduction
- Software Design and Validation
  - System Architecture
  - Virtual Architecture
  - Transaction Accurate Architecture
- Conclusions



#### MATLAB® SIMULINK®

#### Conclusion

- Definition of the different abstraction levels and the HW & SW models
  - System Architecture (SA) in Simulink
  - Virtual Architecture (VA) in SystemC
  - Transaction Accurate Architecture (TA) in SystemC
- Structuring the SW stack into layers allows:
  - Flexibility in terms of SW components reuse (OS, Communication)
  - Portability to other platforms (HAL)
  - Incremental generation and validation of the different SW components by using SW development platforms (HW abstraction models)
- HW abstraction models:
  - VA & TA SystemC platforms are automatically generated from Simulink
  - Allow early performance estimation
  - Easily experiment several communication mapping schemes
  - Allow the efficient use of architecture resources
- Programming Environment applied to:
  - Complex heterogeneous MPSoC: RDT with AMBA, R2DT with NoC,1AX (1 ARM, 1 XTENSA, AMBA)
  - Multimedia applications: H.264 Encoder, M-JPEG Decoder, MP3 Decoder, Vocoder

# Thank you!

33



#### MATLAB&SIMULINK®

#### **References:**

- P.S. PAOLUCCI, A.A. JERRAYA, R. LEUPERS, L. THIELE, P. VINCINI "SHAPES: a tiled scalable software hardware architecture platform for embedded systems", *Proceeding of CODES+ISSS 2006*, Seoul, Korea, pp. 167-172
- K. POPOVICI, X. GUERIN, F. ROUSSEAU, P.S. PAOLUCCI, A. JERRAYA "Platform based Software Design Flow for Heterogeneous MPSoC", *ACM Transactions on Embedded Computing Systems (TECS)*, Accepted 17<sup>th</sup> January 2008
- X. GUERIN, K. POPOVICI, W. YOUSSEF, F. ROUSSEAU, A. JERRAYA "Flexible Application Software Generation for Heterogeneous Multi-Processor System-on-Chip", 31<sup>st</sup> Annual International Computer Software and Applications Conference (COMPSAC'07), 23-27 July 2007, Beijing, China
- K. HUANG, S.I. HAN, K. POPOVICI, L. BRISOLARA, X. GUERIN, L. LI, X. YAN, S.I. CHAE, L. CARRO, A. JERRAYA "Simulink based MPSoC Design Flow: Case Study of Motion JPEG and H.264", *Design Automation Conference (DAC'07)*, 4-8 June 2007, San Diego, USA
- W. WOLF "High-Performance Embedded Computing", Morgan Kaufmann, 2006